fix: ensure node state persisted before shutdown #743

jvsena42 · 2026-01-28T12:36:35Z

Fixes #739

This PR prevents channel state divergence during node shutdown by ensuring state is fully persisted before the service is destroyed.

Description

When the app was stopped while a 0-conf channel had uncommitted state updates, the client (LDK) could end up with a different commitment height than the LSP. On reconnect, the LSP detected this mismatch as "possible data loss" and force-closed the channel.

This PR adds two mitigations:

Final sync before shutdown - Calls syncWallets() before stopping the node to ensure the latest channel state is persisted to VSS
Blocking service shutdown - Uses runBlocking in onDestroy() to wait for the node to fully stop before the service is destroyed, with a 5-second timeout to avoid ANR

Preview

CJIT.webm

multiple-transactions-and-poor-signal.webm

QA Notes

1. Test graceful shutdown

Open the app and create or use an existing Lightning channel
Stop the app using the notification "Stop" button
Check logcat for:
- Performing final sync before shutdown…
- Final sync completed
- onDestroy started
- onDestroy completed
Verify the node stops without errors

2. Test restart after stop

After stopping the app via notification, reopen it
Verify the Lightning channel is still operational
Verify no force-close occurred

3. Regression

Send and receive Lightning payments normally
Verify wallet balance updates correctly

app/src/main/java/to/bitkit/services/LightningService.kt

jvsena42 · 2026-01-28T16:27:35Z

DIdn't find problems in my tests, but this has a timing very difficult to reproduce

…d-force-close

ovitrif

nit: Probably still good to use non-main thread

Look at runBlocking(dispatcher) option

…d-force-close

app/src/main/java/to/bitkit/androidServices/LightningNodeService.kt

jvsena42 · 2026-01-30T11:36:39Z

nit: Probably still good to use non-main thread

Look at runBlocking(dispatcher) option

The execution already switches to bgDispatcher because of suspend fun stop(): Result<Unit> = withContext(bgDispatcher) , but the main thread is still blocked waiting for runBlocking to complete

…d-force-close

jvsena42 · 2026-02-02T12:48:09Z

todo: check iOS

…ll keep running.This prevents the node be stuck on stopping state

claude

Code Review

I found one issue that should be addressed:

can consume entire timeout budget, preventing from executing

File: app/src/main/java/to/bitkit/services/LightningService.kt

The new syncWallets() call before shutdown shares the same 5-second timeout budget (from onDestroy) as the critical node.stop() call. Since syncWallets() is a network operation without its own timeout, it could take the full 5 seconds, leaving no time for node.stop() to execute.

This is problematic because:

syncWallets() is a network call to VSS that can be slow or hang
If it takes most/all of the 5 seconds, node.stop() never runs
This leaves the node in a running but unmanaged state - worse than the original problem

In onTimeout(), the budget is only 2 seconds, making this even more likely to occur.

Suggested fix: Give syncWallets() its own sub-timeout (e.g., 2 seconds out of the 5-second budget) to guarantee node.stop() has time to execute:

runCatching {
    Logger.debug("Performing final sync before shutdown…", context = TAG)
    withTimeoutOrNull(2_000L) {
        ServiceQueue.LDK.background { node.syncWallets() }
    } ?: Logger.warn("Sync timed out, proceeding with shutdown", context = TAG)
    Logger.debug("Final sync completed", context = TAG)
}.onFailure { Logger.warn("Final sync failed, proceeding with shutdown", it, context = TAG) }

This ensures node.stop() always has at least 3 seconds to execute properly.

claude

Code Review

I've reviewed this PR and found 2 significant issues that should be addressed before merging.

claude · 2026-02-02T13:30:18Z

app/src/main/java/to/bitkit/repositories/LightningRepo.kt

+                // On cancellation (e.g., timeout), ensure state is recoverable
+                if (it is CancellationException) {
+                    Logger.warn("Node stop cancelled, forcing Stopped state for recovery", context = TAG)
+                    _lightningState.update { LightningState(nodeLifecycleState = NodeLifecycleState.Stopped) }
+                    return@withLock Result.failure(it)


Issue: Catching CancellationException without rethrowing breaks coroutine cancellation contract

This code catches CancellationException via runCatching and wraps it in Result.failure() without rethrowing. This violates Kotlin's coroutine cancellation contract, which requires CancellationException to always be rethrown for proper structured concurrency.

The same file demonstrates the correct pattern in multiple places:

Line 223: // Cancellation is expected during pull-to-refresh, rethrow per Kotlin best practices followed by if (it is CancellationException) throw it

Line 861: Same pattern

Impact: While this happens to work for the primary onDestroy call site (since withContext(bgDispatcher) may detect parent cancellation independently), stop() is also called from:

wipeStorage() (line 543)

restartWithElectrumServer() (line 562)

restartWithRgsServer() (line 588)

restartWithPreviousConfig() (line 609)

restartNode() (line 1076)

If any of these call sites' coroutines are cancelled while stop() is running, the CancellationException will be silently swallowed, and subsequent code will execute when it should not.

Suggested fix: Rethrow the CancellationException after performing the state recovery:

Suggested change

// On cancellation (e.g., timeout), ensure state is recoverable

if (it is CancellationException) {

Logger.warn("Node stop cancelled, forcing Stopped state for recovery", context = TAG)

_lightningState.update { LightningState(nodeLifecycleState = NodeLifecycleState.Stopped) }

return@withLock Result.failure(it)

// On cancellation (e.g., timeout), ensure state is recoverable

if (it is CancellationException) {

Logger.warn("Node stop cancelled, forcing Stopped state for recovery", context = TAG)

_lightningState.update { LightningState(nodeLifecycleState = NodeLifecycleState.Stopped) }

throw it // Rethrow to properly propagate cancellation

}

claude · 2026-02-02T13:30:45Z

app/src/main/java/to/bitkit/repositories/LightningRepo.kt

            }.onFailure {
+                // On cancellation (e.g., timeout), ensure state is recoverable
+                if (it is CancellationException) {
+                    Logger.warn("Node stop cancelled, forcing Stopped state for recovery", context = TAG)


Issue: Setting state to Stopped when JNI operations may still be running

When a timeout-induced CancellationException occurs, this code forces the state to NodeLifecycleState.Stopped. However, the native node.syncWallets() and node.stop() JNI calls may still be executing on the ServiceQueue.LDK single-threaded dispatcher. JNI calls are not interruptible by coroutine cancellation - they will continue running until completion.

Consequences:

State machine corruption: The lifecycle state says Stopped but the node is still actively performing I/O, persisting state, etc.

Mutex release allows concurrent operations: When return@withLock executes, the lifecycleMutex is released. If start() is called next (e.g., on app relaunch), it will:

Acquire the mutex

See state Stopped

Proceed with startup

Find lightningService.node is non-null (since line 259 in LightningService.kt hasn't executed yet)

Skip setup() and dispatch node.start() to the LDK queue

The still-running stop() block eventually completes and sets [email protected] = null

Result: Repo thinks node is Running, but service's node reference is null

Recovery mechanism doesn't help: The stuck-Stopping recovery at lines 282-286 is never triggered in the timeout scenario because state was already forced to Stopped.

Possible solutions:

Leave state as Stopping instead of forcing to Stopped - this allows the stuck-state recovery mechanism to handle it on next start()

Add a flag in LightningService to track whether native operations are truly complete, independent of coroutine cancellation

Rethink the timeout approach - consider whether the timeout should apply to the entire operation or just serve as a best-effort mechanism with acknowledgment that native cleanup continues in background

jvsena42 added 3 commits January 28, 2026 09:23

fix: perform a last sync before stop to ensure a updated state

562f44f

fix: ensure node stops before service destruction

27123f1

fix: reduce time out

20e246a

jvsena42 self-assigned this Jan 28, 2026

jvsena42 commented Jan 28, 2026

View reviewed changes

app/src/main/java/to/bitkit/services/LightningService.kt Show resolved Hide resolved

jvsena42 marked this pull request as ready for review January 28, 2026 14:48

jvsena42 requested a review from ovitrif January 28, 2026 14:48

This comment has been minimized.

Sign in to view

jvsena42 linked an issue Jan 28, 2026 that may be closed by this pull request

CJIT channel force closed while attempting LNURL withdrawals #739

Open

Merge branch 'fix/node-stopping-bg-payments' into fix/channel-outdate…

964a9f0

…d-force-close

This comment has been minimized.

Sign in to view

ovitrif reviewed Jan 30, 2026

View reviewed changes

Merge branch 'fix/node-stopping-bg-payments' into fix/channel-outdate…

240d16c

…d-force-close

claude bot reviewed Jan 30, 2026

View reviewed changes

app/src/main/java/to/bitkit/androidServices/LightningNodeService.kt Show resolved Hide resolved

jvsena42 marked this pull request as draft January 30, 2026 11:23

Merge branch 'fix/node-stopping-bg-payments' into fix/channel-outdate…

1dd4f51

…d-force-close

jvsena42 added 3 commits February 2, 2026 10:02

fix: force a stop state on CancellationException, because JNI call wi…

46724e1

…ll keep running.This prevents the node be stuck on stopping state

fix: add defensive recovery for stuck Stopping state

36fe6d5

chore: log failed stop

d90e8c0

jvsena42 marked this pull request as ready for review February 2, 2026 13:22

claude bot reviewed Feb 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: ensure node state persisted before shutdown #743

fix: ensure node state persisted before shutdown #743

jvsena42 commented Jan 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

This comment has been minimized.

jvsena42 commented Jan 28, 2026

Uh oh!

This comment has been minimized.

ovitrif left a comment •

edited

Loading

Uh oh!

Uh oh!

jvsena42 commented Jan 30, 2026

Uh oh!

jvsena42 commented Feb 2, 2026

Uh oh!

claude bot left a comment

Uh oh!

claude bot left a comment

Uh oh!

claude bot Feb 2, 2026

Uh oh!

claude bot Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: ensure node state persisted before shutdown #743

Are you sure you want to change the base?

fix: ensure node state persisted before shutdown #743

Conversation

jvsena42 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Preview

QA Notes

1. Test graceful shutdown

2. Test restart after stop

3. Regression

Uh oh!

Uh oh!

This comment has been minimized.

jvsena42 commented Jan 28, 2026

Uh oh!

This comment has been minimized.

ovitrif left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jvsena42 commented Jan 30, 2026

Uh oh!

jvsena42 commented Feb 2, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Code Review

can consume entire timeout budget, preventing from executing

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

claude bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jvsena42 commented Jan 28, 2026 •

edited

Loading

ovitrif left a comment •

edited

Loading